The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity
نویسندگان
چکیده
Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures. Abbreviations: Mk = Fixed order Markov model of order k, PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.
منابع مشابه
Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes
Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded DNA virus. There were two approaches for prediction of each Markov Model parameter,...
متن کاملMarkov Chain Order estimation with Conditional Mutual Information
We introduce the Conditional Mutual Information (CMI) for the estimation of the Markov chain order. For a Markov chain of K symbols, we define CMI of order m, Ic(m), as the mutual information of two variables in the chain being m time steps apart, conditioning on the intermediate variables of the chain. We find approximate analytic significance limits based on the estimation bias of CMI and dev...
متن کاملMalware Detection using Classification of Variable-Length Sequences
In this paper, a novel method based on the graph is proposed to classify the sequence of variable length as feature extraction. The proposed method overcomes the problems of the traditional graph with variable length of data, without fixing length of sequences, by determining the most frequent instructions and insertion the rest of instructions on the set of “other”, save speed and memory. Acco...
متن کاملThe consistency of the BIC Markov order estimator
The Bayesian Information Criterion (BIC) estimates the order of a Markov chain (with nite alphabet A) from observation of a sample path x 1 ; x 2 ; : : :; x n , as that value k = ^ k that minimizes the sum of the negative logarithm of the k-th order maximum likelihood and the penalty term jAj k (jAj?1) 2 log n: We show that ^ k equals the correct order of the chain, eventually almost surely as ...
متن کاملIsotonic Change Point Estimation in the AR(1) Autocorrelated Simple Linear Profiles
Sometimes the relationship between dependent and explanatory variable(s) known as profile is monitored. Simple linear profiles among the other types of profiles have been more considered due to their applications especially in calibration. There are some studies on the monitoring them when the observations within each profile are autocorrelated. On the other hand, estimating the change point le...
متن کامل